Skip to content

Conversation

@clayborg
Copy link
Collaborator

Mach-o has 32 bit file offsets in the MachO::section_64 structs. dSYM files can contain sections whose start offset exceeds UINT32_MAX, which means the MachO::section_64.offset will get truncated. We can calculate when this happens and properly adjust the section offset to be 64 bit safe. This means tools can get the correct section contents for large dSYM files and allows tools that parse DWARF, like llvm-gsymutil, to be able to load and convert these files correctly.

Mach-o has 32 bit file offsets in the MachO::section_64 structs. dSYM files can contain sections whose start offset exceeds UINT32_MAX, which means the MachO::section_64.offset will get truncated. We can calculate when this happens and properly adjust the section offset to be 64 bit safe. This means tools can get the correct section contents for large dSYM files and allows tools that parse DWARF, like llvm-gsymutil, to be able to load and convert these files correctly.
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2025

@llvm/pr-subscribers-llvm-binary-utilities

Author: Greg Clayton (clayborg)

Changes

Mach-o has 32 bit file offsets in the MachO::section_64 structs. dSYM files can contain sections whose start offset exceeds UINT32_MAX, which means the MachO::section_64.offset will get truncated. We can calculate when this happens and properly adjust the section offset to be 64 bit safe. This means tools can get the correct section contents for large dSYM files and allows tools that parse DWARF, like llvm-gsymutil, to be able to load and convert these files correctly.


Full diff: https://github.com/llvm/llvm-project/pull/165940.diff

2 Files Affected:

  • (modified) llvm/include/llvm/Object/MachO.h (+1-1)
  • (modified) llvm/lib/Object/MachOObjectFile.cpp (+16-2)
diff --git a/llvm/include/llvm/Object/MachO.h b/llvm/include/llvm/Object/MachO.h
index 01e7c6b07dd36..f4c1e30b097ee 100644
--- a/llvm/include/llvm/Object/MachO.h
+++ b/llvm/include/llvm/Object/MachO.h
@@ -447,7 +447,7 @@ class LLVM_ABI MachOObjectFile : public ObjectFile {
   uint64_t getSectionAddress(DataRefImpl Sec) const override;
   uint64_t getSectionIndex(DataRefImpl Sec) const override;
   uint64_t getSectionSize(DataRefImpl Sec) const override;
-  ArrayRef<uint8_t> getSectionContents(uint32_t Offset, uint64_t Size) const;
+  ArrayRef<uint8_t> getSectionContents(uint64_t Offset, uint64_t Size) const;
   Expected<ArrayRef<uint8_t>>
   getSectionContents(DataRefImpl Sec) const override;
   uint64_t getSectionAlignment(DataRefImpl Sec) const override;
diff --git a/llvm/lib/Object/MachOObjectFile.cpp b/llvm/lib/Object/MachOObjectFile.cpp
index e09dc947c2779..300a5f7ed2a48 100644
--- a/llvm/lib/Object/MachOObjectFile.cpp
+++ b/llvm/lib/Object/MachOObjectFile.cpp
@@ -1978,20 +1978,34 @@ uint64_t MachOObjectFile::getSectionSize(DataRefImpl Sec) const {
   return SectSize;
 }
 
-ArrayRef<uint8_t> MachOObjectFile::getSectionContents(uint32_t Offset,
+ArrayRef<uint8_t> MachOObjectFile::getSectionContents(uint64_t Offset,
                                                       uint64_t Size) const {
   return arrayRefFromStringRef(getData().substr(Offset, Size));
 }
 
 Expected<ArrayRef<uint8_t>>
 MachOObjectFile::getSectionContents(DataRefImpl Sec) const {
-  uint32_t Offset;
+  uint64_t Offset;
   uint64_t Size;
 
   if (is64Bit()) {
     MachO::section_64 Sect = getSection64(Sec);
     Offset = Sect.offset;
     Size = Sect.size;
+    // Check for large mach-o files where the section contents might exceed
+    // 4GB. MachO::section_64 objects only have 32 bit file offsets to the
+    // section contents and can overflow in dSYM files. We can track this and
+    // adjust the section offset to be 64 bit safe.
+    uint64_t SectOffsetAdjust = 0;
+    for (uint32_t SectIdx=0; SectIdx<Sec.d.a; ++SectIdx) {
+      MachO::section_64 CurrSect =
+          getStruct<MachO::section_64>(*this, Sections[SectIdx]);
+      const uint64_t EndSectFileOffset =
+          (uint64_t)CurrSect.offset + CurrSect.size;
+      if (EndSectFileOffset >= UINT32_MAX)
+        SectOffsetAdjust += EndSectFileOffset & 0xFFFFFFFF00000000ull;
+    }
+    Offset += SectOffsetAdjust;
   } else {
     MachO::section Sect = getSection(Sec);
     Offset = Sect.offset;

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

We now return an error if a section file offset exceeds 4GB and the sections are not ordered in the mach-o file. If sections are not ordered, we can't assume the section file offset overflows make sense to apply to other sections, but we can if they are ordered.
@clayborg clayborg force-pushed the llvm-macho-section-offset-64 branch from 457d287 to 350328a Compare November 3, 2025 00:38
@ellishg
Copy link
Contributor

ellishg commented Nov 3, 2025

Thanks for fixing! Can we add a test?

@DataCorrupted
Copy link
Member

Thanks for fixing! Can we add a test?

Just tested locally, some 8GB dSYM was consumed successfully by llvm-dwarfdump and gsymutil. I don't have any out-of-order larger-than-4GB dSYM to test the error handling, and I think that's fairly rare case. In terms of checking a test file into the git, I think our experience in lldb (#164471) told us that its just hard to do.

@clayborg
Copy link
Collaborator Author

clayborg commented Nov 3, 2025

Thanks for fixing! Can we add a test?

Just tested locally, some 8GB dSYM was consumed successfully by llvm-dwarfdump and gsymutil. I don't have any out-of-order larger-than-4GB dSYM to test the error handling, and I think that's fairly rare case. In terms of checking a test file into the git, I think our experience in lldb (#164471) told us that its just hard to do.

We definitely can't commit a huge binary, and obj2yaml amd yaml2obj won't create invalid object files (where the segments would have the numbers for us that we need without having the file data), nor do we want them to emit 4 to 8 GB binaries to disk just so we can test them. I also didn't see a Test.cpp variant for llvm/lib/Object/MachOObjectFile.cpp anywhere. Also if you create a mach-o file with just some LC_SEGMENT_64 load commands, but the data for the file isn't in the file, it will not load and emit an error I believe. So this is hard to test. I am open to ideas if anyone has any, but I can't find any acceptable solutions.

@DataCorrupted
Copy link
Member

LTGM.

I wonder if we can relax MachO's standard by saying that "32-bit section offsets can exceed 4G but must be ordered for 64-bit offset reconstruction". This way we can use little change (like this PR) to alleviate the hard restriction.

@clayborg clayborg merged commit 6601c38 into llvm:main Nov 4, 2025
10 checks passed
@clayborg clayborg deleted the llvm-macho-section-offset-64 branch November 4, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants